perm filename OLDVIS[0,BGB] blob sn#109793 filedate 1974-07-09 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00019 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	6.0	Computer Vision Theory.
C00003 00003	{[C<NαVISION THEORY.
C00005 00004	[6.1	Vision Systems.]
C00009 00005
C00013 00006
C00018 00007	[6.2	Vision Tasks.]
C00024 00008
C00029 00009
C00033 00010	[6.3	The Nature of Images.]
C00039 00011	[2.4	The Nature of Worlds.]
C00046 00012
C00050 00013	[6.5	Locus Solving.]
C00053 00014	[6.6	Comparing.]
C00056 00015	[6.7	Mobile Robot Vision.]
C00060 00016
C00065 00017
C00069 00018	[6.8	Related Vision Work.]
C00075 00019	[6.9 	Summary.]
C00078 ENDMK
C⊗;
6.0	Computer Vision Theory.

CONTENTS:

	6.0	Introduction to Computer Vision Theory.
	6.1	Vision Systems.
	6.2	Vision Tasks.
	6.3	The Nature of Images.
	6.4	The Nature of Worlds.
	6.5	Locus Solving.
	6.6	Comparing.
	6.7	Mobile Robot Vision.
	6.8	Related Work.
	6.9	Summary.

BOXES:

	6.1	Three basic modes of vision.
	6.2	Vision Mandala.
	6.3	Some basic kinds of vision processors.
	6.4	Table of 3-D computer vision tasks.
	6.5	Alternatives to 3-D geometric modeling.
	6.6	Three kinds of Locus Solving.
	6.7	Chauffer Cart Task Solution.

FIGURES:

	6.1	Vision System Hierarcy
	6.2	Cart Vision Mandala

{[C;<N;αVISION THEORY.;
λ30;P60;I425,0;JCFA             SECTION 6.
{JCFD                    COMPUTER VISION THEORY.

{I950,0;FA
[6.0	Introduction to Computer Vision Theory.]
{JU}
	Computer vision concerns programming a computer  to do a task
that  demands the  use of  an image  forming light  sensor such  as a
television camera.  The theory I intend to elaborate is  that general
3-D  vision is  a continuous  process of  keeping an  internal visual
simulator  in sync with perceived images  of the external reality  so
that vision tasks can be  done by reference to the  simulator's model
rather than  by reference to the original  images. The word <theory>,
as used  here,    means  simply a  set  of  statements  presenting  a
systematic view  of a subject.  Specifically, I  wish to exclude  the
connotations  that the theory is  a mathematical theory  or a natural
theory.  Perhaps there can be such a thing as an  <artificial theory>
which extends from the philosophy thru the design of an artifact.
{Q}
[6.1	Vision Systems.]

	Vision systems mediate between images and world models, these
two  extremes  of a  vision system  are  called, in  the  jargon, the
<bottom> and  the <top> respectively.   In  what follows,   the  word
<image> will be used only to refer to the notion of a 2-D data
structure  representing a  picture.  A picture is  a rectangle
taken from the pattern of  light formed by a thin lens on  the nearly
flat  photoelectric  surface of  a  television  camera's vidicon.  A
sequence of images in time will be called a film. On the other  hand,
a <world model>  is a data  structure which is supposed  to represent
the physical world for the purposes of  a task processor.
In particular,  the main point of this thesis concerns isolating a portion
of  the world model (called  the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with.  The  vision hierarcy, so formed, is illustrated in the
box (6.1).

{|λ10;JA}
BOX 6.1 {JC} Vision System Hierarcy.

{JC} Task Processor
{JC} |
{JC} Task world model
{JC} |
		 The  Top   → {JC} 3-D geometric model
{JC} |
		 The Bottom → {JC} 2-D images

{|λ30;JU}

	Considering the  gap  between images  and world  models,   it
seems that a general vision system has three distinguishable modes of
operation: recognition,  verification  and  description.  Recognition
vision can  be characterized as bottom  up random access. What  is in
the  picture is determined by  extracting a set  of features from the
image  and  by  classifing  them  according  to  a  statistical  key.
Verification  vision, also called  top down  or model  driven vision,
involves predicting an  image,  followed  by comparing the  predicted
image and a  perceived image for  differences which are  expected but
not yet measure.


	Descriptive  vision, also  called bottom  up  or data  driven
vision involves  converting the image into a different representation
which makes it possible (or easier) to do the desired vision  task.
I would like to call this kind  of vision "revelation vision" at times;
although  I acknowledge that  the phrase "descriptive  vision" is the
term used by most members of the computer vision community.

{|λ10;JU;FA
Box 6.1 {JC} Three Basic Modes of Vision.

	1. Recognition Vision - feature classification. (bottom up into an existing top).
	2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
	3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}

	Now we have enough pieces to outline  the bare system design.
By placing  a 3-D geometric model in  the gap; recognition vision can
be could be done on 3-D (rather than 2-D) features into the task  world model;
and description  vision and verification  vision could be used  to link
the  2-D and  3-D models  in a  relatively dumb,  mechanical fashion.
Previous attempts to use recognition vision,  to  bridge directly the
large  gap between 2-D  images (of  3-D objects)  and the  task world
model, have  been frustrated  because  the characteristic  2-D  image
features of  a  3-D object  are very  dependent on  the 3-D  physical
processes  of occultation,  rotation  and illumination.   It is these
processes that  will have  to be  modeled and  understood before  the
features  relevant to  the task  processor  can be  deduced from  the
perceived  images.  The arrangement  of  these elements  is diagramed
below.

{|λ10;JA}
Box 6.2 {JC} A Simple Vision System Mandala.

{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑            ↓
{JC} DESCRIPTION        VERIFICATION
{JC} ↑            ↓
{JC} 2-D images
{|λ30;JU}

	I wish  to  call attention  to the  lower part  of the  above
mandala  (a <mandala> is any  circle-like  system diagram);  this
portion is  the core  of the  3-D geometric  modeling vision  system.
Depending on circumstances,  the vision  system should be able to run
almost   entirely   top-down  (verification   vision)   or  bottom-up
(revelation vision). Verification vision is all that is required in a
well  know predictible  environment; whereas,   revelation  vision is
required  in  a   brand  new  (tabula   rasa)  or  rapidly   changing
environment.    Thus   revelation  and  verification  form   a  loop,
bottom-up and top down. First,  there is some kind of revelation that
forms (or selects) a 3-D model; and second,  the model is verified by
testing image features  predicted from the assumed model.   This loop
like structure has been noted before by others; it is a form of what
Tenebaum(71) called <accomodation> and it is a form of what Falk(69)
called  <heuristic   vision>; however I will go along with what I think
is the current majority of vision workers who call it <visual feedback>.

	Completing  the   design,     the  images   and  worlds   are
constructed, manipulated and compared by a variety of processors. The
topmost of which is the  task processor. Since the task processor  is
expected to  vary with the application;  it would be expedient  if it
could  be isolated as a  user  program  calling on utility routines
of  an appropriate  vision  sub-system.  Immediately below  the  task
processor  are the  3-D  recognition routines  and  the 3-D  modeling
routines. The  modeling routines  underlie most  everything;  because
they are used to create, alter and access the models.

~|-----------------------------------------------------------λ10;JAFA
Box 6.4	Basic Kinds of Vision Processors in a 3-D Vision System.
	0. The task processor.
	1. 3-D recognition.
	2. 3-D modeling  routines.
	3. Reality simulator.
	4. Image analyser.
	5. Image synthesizer - hidden line eliminator.
	6. Locus solvers: camera, sun and object.
	7. Comparators: 2D and 3D.
~|-----------------------------------------------------------λ30;JUFA

	The remaining processors include the  reality simulator which
does Newtonian  mechanics for modeling motion, collision and gravity.
Also there  are image  analyzers,   which  do image  enhancement  and
conversions  such as  converting  video rasters  into line  drawings.
There  is an  image synthesizer, which  does hidden  line and surface
elimination, for verification by comparing synthetic  images from the
model  with perceived  images of  reality. There  are three  kinds of
locus solvers that compute numerical descriptions for cameras,  light
sources and physical objects.   Finally,  there is of  course a large
number (at least  ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.

[6.2	Vision Tasks.]

	The 3-D  vision research problem  being discussed is  that of
finding  out how to  write programs that  can see in  the real world.
Related vision problems  include: modeling  human
perception,  solving visual  puzzles (non-real world), and developing
advanced automation  techniques (ad hoc vision). In order to approach
the problem,  specific  programming tasks are proposed  and solutions
are sought; however please  distingush the idea of a research problem
from that of a programming task. As will be illustrated, many  vision
tasks can be  done without vision.   The vision solution to  be found
should  be  able  to  deal with  real  images,    should include  the
continuity of the  visual process in  time and  space, and should  be
general  purpose  rather  than  ad hoc.    These  three  requirements
(reality,   continuity,   generality) will be  developed by surveying
six examples of computer vision tasks.

	First, there  is the  robot chauffer task.   In  1969,   John
McCarthy asked  me to consider the vision  requirements of a computer
controlled car such as he depicted  in an unpublished essay. The  idea
is that a user of such an automatic  car would request a destination;
the  robot would select a  route from an internally  stored road map;
and it would then proceed  to its destination using visual data.  The
problem  involves  representing the  road  map  in the  computer  and
establishing the correspondence between the map and the appearance of
the road  as  the automatic  chauffer drived  the  vehicle along  the
selected route.  Lacking a  computer controlled car,  the problem was
abstracted to that of tracing a route along the driveways and parking
lots that  surround the Stanford  A.I. Laboratory using  a television
camera and  transmitter mounted on a  radio controlled electric cart.
The robot chauffer task could  be solved by non-visual means such  as
by railroad  like guidance or  by inertial guidance; to  preverse the
vision  aspect of  the problem,   no  particular artifacts  should be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.

	Second,   there is the  task of a  robot explorer.   In 1967,
McCarthy and  Lederberg,   published  a description  of a  robot  for
exploring the  surface of the  planet Mars.   The robot  explorer was
required to  run for long periods of  time without human intervention
because the signal transmission  time to Mars  is as great as  twenty
minutes and because the 23.5 hour Martian day would place the vehicle
out of Earth sight for twelve hour at a time. (This latter difficulty
could be  avoided at  the expense of  having a  set of  communication
relay  satellites in  orbit around  Mars). The  task of  the explorer
would be to  drive around mapping  the surface of  Mars, looking  for
interesting features,  and doing various experiments.  To be prudent,
a Mars  explorer should be able to  navigate without vision; this can
be done  by  driving slowly  and by  using  a tactile  collision  and
crevasse detector.  If the  television system fails, the core samples
and  so on can still be collected  at different Martian sites without
unusual risk to the vehicle due to visual blindness.

	The third vision  task is that  of the robot soldier,   tank,
sentry, pilot or  policeman.  The problem has several forms which are
quite similar to the chauffeur  and the explorer with the  additional
goal of doing something nasty to an enemy.  Although this vision task
has not yet been explicitly attempted at Stanford,  to the best of my
knowledge, the reader  should be warned  that a thorough solution  to
any  of the other  tasks almost  assures the Orwellian  technology to
solve this one.

	Fourth, the turn table task is to construct a 3-D model  from
a sequence of 2-D  television images taken of an object  rotated on a
turn table.   The turntable task was  selected as a simplification of
the explorer  task and  is an  example of  a  nearly pure  desriptive
vision task.

	Fifth, the classic blocks vision  task consists of two parts:
first  convert a  video image into  a line  drawing; second,   make a
selection from a  set of predefined  prototype models of blocks  that
accounts  for the line  drawing.   In my opinion,   this  vision task
emphasives three pitfalls:  single image vision,   line drawings  and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption  that a  single  image is  to be  solved;  thus diverting
attention away  from the  most important  depth perception  mechanism
which is parallax.  The second pitfall  is that the usual notion of a
perspective line  drawing is not a natural intermediate state; but is
rather a very sophisticated and platonic geometric  idea. The perfect
line drawing lacks photometric  information; even a line drawing with
perfect shadow lines  included will  not resemble  anything that  can
readily be gotten by processing  real television pictures. Curiously,
the  lack of success in  deriving line drawings  from real television
images of real blocks has not dampened interest in solving the second
part  of the  problem. The  perfect  line drawing  puzzle, was  first
worked  on  by  Guzman  and extended  to  perfect  shadows  by Waltz;
nevertheless, enough remains so  that the puzzle will persist  on its
own merits,   without being  closely relevant to  real world computer
vision.  Even assuming  that imperfect line  drawings are given,  the
final  unreality  of  the  blocks  themselves,    have  seduced  such
researchers  as  Falk   and  Grape  to  build  byzantine  systems  of
vertex-edge classification which almost certainly can not be extended
beyond the  blocks domain. Actually, the  blocks would not be  such a
bad  research simplification, if researchers could avoid getting hung
up in the  fact that they have  edges and vertices,   but concentrate
instead on  where the block are  and on how they scatter  light.  The
blocks task can  be rehabilitated by  requiring photometric  modeling
and by requiring the use multiple images for depth perception.

	Sixth, the Stanford  Hand Eye Project has  recently dedicated
itself  to  solving  the task  of  automatic  machine  assembly.   In
particular, the group  will try to develope  techniques that will  be
demonstrated by the fully automatic  assembly of a chain saw gasoline
engine.   The two  pressing vision questions of  machine assemble are
where is  the part and  where is  the hole;  these questions will  be
initially  handled by composing  ad hoc  part and hole  detectors for
each vision step required for the assembly.

	The point of this  task survey was to  sharpen our taste  for
what is and is not a task requiring real 3-D vision; and to point out
that  caution has  to be taken  to preserve  the vision aspects  of a
given task. In the usual course of vision projects, a single  task or
a single  tool unfortunately  dominates the research;  my work  is no
exception, the  one tool is 3-D modeling, and the task that dominated
the formative stages  of the research is  that of the robot  chauffer
cart.  A better  understanding  of the  ultimate  nature of  computer
vision can be obtained by keeping  the several tasks and the  several
tools in mind.

~|---------------------------------------------------------------λ10;JAFA
BOX 6.4	TABLE OF 3-D COMPUTER VISION TASKS.

1. The Robot Chauffeur. Cart Task.
	Given a computer  controlled cart and a road map,
	drive the cart along a preselected route,
	without crashing into anything.

2. The Robot Explorer. Cart Task.
	Given a computer  controlled cart,
	explore and map the world,
	without crashing into anything.

3. The Robot Soldier. Cart Task.
	Given a computer controlled vehicle,
	locate and destroy the enemy.

4. Turn Table Task.
	The turn table task in to construct a 3-D model from a
	sequence of 2-D television images taken of an object
	rotated on a turn table.

5. The Blocks Task.
	First, convert a video image into a line drawing;
	Second, identify and locate the blocks in the line drawing.

6. Machine Assembly Tasks.
	Where is the part ? Where is the hole ?
	Location Task:	Where is it.
	Identification Task: What is it.
~|---------------------------------------------------------------λ30;JUFA
[6.3	The Nature of Images.]

	An  image is  a  2-D spatial  data  structure representing  a
rectangle from the pattern of light formed by a thin lens. A sequence
of images in time is called  a film.  For the present design  theory,
there are three basic kinds  of information in an image: photometric,
geometric,    and topological;  also  there  are three  kinds  of 2-D
images: raster,  contour,  and mosaic.

	Geometry fundamentally has  to do with distance  measure. The
geometry of an image is based on coordinate pairs that are associated
with the  elements that  form the  image. From  the coordinates  such
geometric properties  as length,   area,   angle  and moments  can be
computed. Photometry has to do with light measure.  Although physical
measurements  of   light  may  include   power,  hue,     saturation,
polarization and phase; only the relative power between points of the
same image is  easily available  to the computer  using a  television
camera. The  acquistion of color  images is  possible at Stanford  by
means of color filters; however,  color does not significantly change
the vision problem  at hand. Topology has  to do with  neighborhoods,
what is next to what.  Topological data may be explicitly represented
by  pointers between related entities,   or implicitly represented by
the format of the data.

	A raster image is  a two dimensional integer valued  array of
pixels.  A  pixel,  "picture element", is a single sample position on
a vidicon. Although the real shape  of a pixel is probably that of  a
blunt  ellipse; the  fiction  that pixels  tesselate  the image  into
little rectangles will be adopted. For other theoretical purposes the
array is  assumed  to be  formed  by sampling  and truncating  a  two
argument, smooth, infinitely differentable real valued function; this
latter fiction will only occasionally be needed here.

	Given a raster image,  the classical  approach is to find the
features.   One  feature, which  rarely escapes  the notice  even the
dullest student of  the subject, is  called the  "edge". For a  naive
start, an  edge can  be defined  as a  locus of  change in  the image
function;  the locus  set where  edges are  not found will  be called
"regions". Edges  and  regions are  two sides  of  the same  slippery
thing, which even the sharpest  students of vision have not yet fully
grasped. A sophisticated definition of the region/edge notion  should
include an effective procedure for  converting a raster approximation
of   an  image  function  into   a  region/edge  representation.  Two
region/edge structures of  particular value are  the contour map  and
the mosaic.

	A  contour image  is like  a  geodesic contour  map. In  such
contour  images no two contours ever cross  and all contours close. A
mosaic image (or tesselation) is like a ceramic tile  mosaic. In such
a mosaic  image no two  regions ever overlap  and the whole  image is
explicitly covered with tiles.  Further useful restrictions might  be
made concerning whether it  is permitted to have tiles  with holes in
them, and whether it  is permitted for a tile to have points that are
thinner than a single pixel.

[2.4	The Nature of Worlds.]

	The physical information most directly  relevant to vision is
the location,  extent and light scattering properties of solid opaque
objects; the location,   orientation  and scales of  the camera  that
takes the  pictures; and the  location and  nature of the  light that
illuminates  the world.    The transformation  rules of  the everyday
world that  a  programmer  may assume,  a  priori,  are the  laws  of
Newtonian physics.  The arguments against  geometric modeling, divide
into two catagories: the reasonable and the intuitive.

	The reasonable  arguments  attack 3-D  geometric modeling  by
comparing it  to another modeling alternative, (some alternatives are
listed in the box immediately below).  Actually, the  domains
of  greatest  efficiency of  the  possible  kinds  of models  do  not
overlap;  and an artificial intellect  will have some portion of
each  kind.  Nevertheless, I  feel  that  3-D  geometric  modeling  is
superior for  the task at  hand, and that  the other models  are less
relevant to vision.

~|---------------------------------------------------------------λ10;JAFA
BOX:	Alternatives to 3-D Geometric Modeling in a Vision System.
		1. Image memory and with only camera model in 3-D.
		2. Statistical world model, e.g. Duda & Hart.
		3. Procedural Knowledge, e.g. Hewett & Winograd.
		4. Semantic knowledge, e.g. Wilkes & Shank.
		5. Formal Logic models, e.g McCarthy & Hayes.
		6. Syntactic models.
~|---------------------------------------------------------------λ30;JUFA

	The best alternative  to a 3-D  geometric model is to  have a
library of little 2-D images describing the appearance of various 3-D
loci  from  given  directions.    The  advantage  would  be   that  a
sophisticated image  predictor would  not be  required; on  the other
hand the  image library is potentially quite large and that even with
a huge  data base  new views  and  lighting of  familair objects  and
scenes can not be anticipated.

	The statistical model, is quite relevant to vision and can be
added to the  geometric model. However, the statistical model can not
stand  alone because  the  processes  of  occultation,  rotation  and
illumination make the approach infeasible.

	Procedural knowledge models  represent the world in  terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the  world in terms of first order predicate calculus or in
terms of a  situation calculus. The  procedural, semantic and  formal
logic world  models are of course  all general enough  to represent a
vision model and in a theoretical sense they are just other notations
for  3-D  geometric modeling.    However  in  practice,  these  three
modeling   regimes  are  not   efficient  holders   and  handlers  of
quantitative geometric  data; but are  rather intended  for a  higher
level  of  abstract reasoning.  Another  alleged  advantage of  these
higher  models  is  that they  can  represent  partial  knowledge and
uncertainty,  which  in  a  geometric  model  is   implicit, in  that
structures  are  missing or  incomplete.  For  example, McCarthy  and
Feldman demand that when a robot has only seen the front of an office
desk that the model should be able to  draw inferences about the back
of the desk; I  feel that this so called advantage is not required by
the problem  and that basic  visual modeling  is on  a more  agnostic
level.

	The syntactical  approach to  descriptive vision  is that  an
image  is a sentence  of a picture  grammar and  that consquently the
image description should be given in terms of the sequence of grammar
transformations rules. Again this  paradigm is theoretically true but
impractical   for  real   images  of   3-D  objects   because  simple
replacements rules  can not readily  express rotation,   perspective,
and photometric  transformations. On the other  hand, the syntactical
models have been of  some use in describing  2-D shapes. (Gipps,  74;
Freeman 65).

	The intuitive arguments  include the opinions  that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research.   Since,  I  suffer a lack of  sympathy
for these positions,   I will forsake any pretext  of objectivity and
attack  fallacies.

	The  natural mimicry fallacy  is that  it is false  to insist
that a  machine must  mimic nature  in order  to achieve  its  design
goals. Boeing 747's are not covered with feathers; trucks do not have
legs;  and  computer  vision  need  not  simulate human  vision.  The
advocates of the  uniqueness of  natural intellegence and  perception
will have  to come up  with a rather unusual uniqueness proof to  establish their
conjecture.  In the  meantime, one  should be  open minded  about the
potential forms a perceptive counsciousness can take.

	The self introspection fallacy is that it  is false to insist
that  one's introspections about  how he  thinks and sees  are direct
observations of  thought and  sight. By  introspection some  conclude
that  the  visual  models  (even  on a  low  level)  are  essentially
qualitative  rather than  quantative.  My  belief is  that the vision
processing of the  brain is quite  quantitative and only passes  into
qualities  at a  higher level  of the  process. In  either  case, the
details of human visual processing are inaccessible to conscious self
inspection.

	Although, I think that the above two fallacies of intuitition
generate  an anti numerical  model prejudice, convincing  a person of
these fallacies doesn't  seem to remove his  doubts.  Some  important
argument or  idea is  missing that would  convince the  so prejudiced
potential  vision worker of the importance  of numerical models prior
to the full  achievement of  computer vision (vice versa, I have  not
heard an  argument that  would change my  prejudice in favor  of such
models).    This  matter  of  conflicting  intuitions  would  not  be
important,   were  it  not that  the  "they" include  so  many of  my
immediate collegues.   (Of course,  I may well be proved wrong
if really powerful 3-D computer  vision systems are ever built
without using any geometric models worth speaking of).

[6.5	Locus Solving.]

	The crux  of computer vision  is to deduce  information about
the world being  viewed from images of that world. Accordingly, three
main descriptive vision problems are camera solving, body solving and
sun solving.

~|-----------------------------------------------------λ10;JAFA
		Three kinds of Locus Solving:
			1. Camera Locus Solving.
			2. Body Locus Solving.
			3. Sun Locus Solving.
~|-----------------------------------------------------λ30;JUFA

	Camera solving  is routinely done  in two  ways: in a  single
image the image loci  of a set of known world loci (perhaps points on
a calibration object) are measured and a camera model computed; or in
two images a  set of corresponding landmark feature  points are found
and the  whole system is solved.  To calibrate a camera with a single
image requires knowing something about the world apriori; calibrating
a  camera with multiple  images can  solve for everything  except the
"true" scale and origin of the world.

	After the camera positions are known, the location and extent
of  the objects  composing  the scene  can be  found  using parallax.
Parallax is  the  principal means  of  depth  perception and  is  the
alchemist that converts 2-D images into 3-D models.
	
	After  the camera  and  object positions  are  known to  some
accuracy,   the nature and  location of light sources  can be deduced
from the  shines and  shadows.   On  the other  hand,   with  outdoor
situations the sun position can be rather accurately predicted with a
simple  emphemeris  which relieves  the body  solver  of some  of the
burden of filtering out photometric effects.

[6.6	Comparing.]

	The compare process  is the keystone for completing  the arch
of  a visual feedback  system;  however the compare  need not require
much study if the entities  to be mated can be highly  individualized
so that there is either a monogamous match or nothing. (Although, the
study  of  sophisticated  comparing is  important;  in  a descriptive
approach to vision, when a simple compare doesn't  work, one censures
the object representation rather that the match processor.)

	Three  important  compares  can  be characterized  as  verify
compare, reveal compare  and recognize compare.   The verify  compare
involves finding the corresponding entities between a predicted image
and a perceived  image for the sake of camera calibration and for the
sake  of eliminating  know  landmark  features  from  the  revelation
process.    The reveal  compare  involves  finding the  corresponding
entities in two percieved images,  so that the location and extent of
new landmark  objects  can  be solved.    Finally,   the  recognition
compare involves mating a percieved  entity with one of a set of know
model entities; recognition is done both in the 2-D image domain  and
in the domain of  3-D object models, although the  later will receive
more attention.

	Two  problems of description comparing  are normalization and
segmentation.     Normalization   involves   eliminating   irrelevant
differences such as location,  orientation and lighting. Segmentation
involves  subdividing a complex object into  pieces, so that only the
simple small pieces need be matched.

[6.7	Mobile Robot Vision.]

	The elements  discussed so far  will now be  brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram. Although  the robot
chauffered cart  was the  main task  theme of  this research;  I have
failed  to date,  March 1974,  to achieve  the hardware  and software
required to  drive  the  cart around  the  laboratory under  its  own
control. Nevertheless,   this necessarily theoretical cart system has
been of  considerible  use  in developing  the  visual  3-D  modeling
routines and theory, which are the subject of this thesis.

~|----------------------------------------------------------λ10;JUF1
FIGURE 6.2	Cart Vision Mandala.

 →→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
 ↑	               WORLD         SIMULATOR         WORLD      ↓
 ↑  								  ↓
 ↑								  ↓
 ↑                   PERCEIVED →→→→→→  CART →→→→→→→→ PREDICTED →→→↓
 ↑	            CAMERA LOCUS      DRIVER        CAMERA LOCUS  ↓
 ↑	                ↑		↓		   	  ↓
 ↑	                ↑		↓		   	  ↓
 ↑                      ↑	      THE CART	     PREDICTED→→→→↓
BODY                 CAMERA			     SUN LOCUS 	  ↓
LOCUS		     LOCUS				 	  ↓
SOLVER		     SOLVER				          ↓
 ↑			↑				          ↓
 ↑			↑			 	          ↓
REVEAL 	             VERIFY				       IMAGE  
COMPARE		     COMPARE				 SYNTHESIZER
 ↑   ↑	 	      ↑   ↑				          ↓
 ↑   ↑                ↑   ↑ 				          ↓
 ↑   ←←	PERCEIVED→→→→→↑   ↑←←←←←←←←←←←←←←←←←←←←	PREDICTED  ←←←←←←←↓
 ←←←←← MOSAIC IMAGE			      MOSAIC IMAGE        ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑              ↓
	PERCEIVED			        PREDICTED         ↓
      CONTOUR IMAGE			      CONTOUR IMAGE       ↓
	   ↑					   ↑ 	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	PERCEIVED				PREDICTED ←←←←←←←←←
       VIDEO IMAGE			       VIDEO IMAGE
	   ↑
	   ↑
	   ↑
       TELEVISION
	 CAMERA
~|-----------------------------------------------------------λ30;JUFA

	The   robot   chauffer   task   involves   establishing   the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route  is assumed to be clear, and  the cart and the
sun  are assumed  to be the  only movable  things in  a static world.
Dealing with moving  obstacles is  a second problem,   motion thru  a
static world must be dealt with first.

	The cart  at the Stanford  Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries,  a television
camera,   a television transmitter, a box of  digital logic, a box of
relays,   and a  toy airplane  radio receiver.    (The vehicle  being
discussed is  not "Shakey",   which belongs  to the  Stanford Reseach
Institute's  Artificial Intelligence Group.  There  are two A.I. labs
near Stanford and  each has a  computer controlled vehicle). The  six
possible cart actions are: run forwards,  run backwards, steer to the
left,  steer to the right, pan camera to the left,  pan camera to the
right.   Other than  the television  camera,   there is no  telemetry
concerning the state of the cart or its immediate environment.

	The solution to the  cart problem, begins with the  cart at a
known  starting position  with a  road map  of visual  landmarks with
known loci. That is,  the upper leftmost  two rectangles of the  cart
mandala  are initialized  so that  the perceived  cart locus  and the
perceive  world correspond with  reality.  Flowing across  the top of
the mandala, the  cart driver, blindly  moves the cart forward  along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus.  The  reality
simulator is  an identity in  this simple case  because the  world is
assumed static.  Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing  the landmarks
features  expected to  be in  view.  Now, in  the lower  left of  the
mandala,  the cart's television camera takes  a perceived picture and
(flowing upwards) the picture  is converted into a form  suitable for
comparing and  matching with the  predicted image. Features  that are
both predicted  and perceived  and found  to match  are used  by  the
camera locus  solver to compute  a new  perceived camera locus  (from
which  the cart locus can  be deduced). Now the  cart driver compares
the perceived and  the predicted cart locus  and corrects its  course
and moves the cart again, and so on.

~| ----------------------------------------------------λ10;JAFA
BOX 6.7	Chauffer Cart Task Solution.

 	1. Predict (or retrieve) 2D image features.
	2. Perceive (take) a television picture and convert.
	3. Compare (verify)  predicted and perceived features.
	4. Solve for camera locus.
	5. Servo the cart along its intended course.
~| ----------------------------------------------------λ30;JUFA

	The remaining limb of the cart mandala is invoked in order to
turn  the chauffer into an  explorer.  Perceived  images are compared
thru time by the reveal compare  and new features are located by  the
body locus solver and placed into the world model.

	Now  the generality  and feasibility  of such  a cart  system
depends  almost entirely on  the representation of the  world and the
representation  of  image  features.  (The  more  general,  the  less
feasible). Although, the bulk  of the rest of this document developes
polyhedral representation  for the  sake of  photometric  generality;
four simpler cart systems could be realized by using simpler models.

	A first system, consists of a road map,  a road model, a road
model  generator,  a solar  emphemeris,  an image  predictor an image
comparator, a camera  locus solver, and a  course servo routine.  The
roadways and nearby environs are  entered into the computer. In fact,
real  roadways are constructed from a  two dimensional X,Y allignment
map showing way the  center of the road  goes as a curve  composed of
line  segement and circular  arcs; and  a second two  dimensional S,Z
elevation diagram; showing the height of the surface above sea  level
as a  funtion of  distance along the  road; as  a sequence of  linear
grades and vertical  arcs which (not too surprising) are nearly cubic
splines. A second version, is  like the first except the road  model,
road model generator,  and image predictor are replaced  by a library
of  road images.  In this  system the  robot vehicle is  "trained" by
being driven down the roads  it is suppose to follow. A  third system
is like  the first except that  the road map is  not initially given,
and indeed  the road  is no  longer presumed  to exist.  Part of  the
problem becomes finding a road, a road in the  sense of a clear area;
this version yeilds  the cart explorer and if the clear area is found
quite rapidly and the world is updated quite frequently, the explorer
can be a  chauffer that can handle obstacles and  moving objects. The
fourth system  is like the third, except that the world is modeled by
a  single  valued  surface  elevation  function,  rather  than  by  a
polyhedral model.

[6.8	Related Vision Work.]

	Larry Roberts is  justly credited for doing the  seminal work
in 3-D  Computer Vision; although his thesis  appeared over ten years
ago the subject has languished  dependent on and overshadowed by  the
four areas  called: Image Processing,   Pattern Recognition, Computer
Graphics,     and  Artificial   Intelligence.  Outside  the  computer
sciences, workers in psychology, neurology and philosophy also seek a
theory of vision.

	IMAGE  PROCESSING  involves  the  study  and  development  of
programs that enhance,   transform and compare 2D images.  Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A good survey of this field can be found in an
article  by Rosenfeld(69).   Image  PATTERN RECOGNITION  involves two
steps: feature extraction  and classification.  A  comprehensive text
about this field with respect to computer vision, has been written by
Duda and Hart(73).  COMPUTER  GRAPHICS is the inverse of  discriptive
computer vision.   The problem of  computer graphics is  to synthesis
images  from  three dimensional  models; the  problem  of discriptive
computer vision is to  analyze images into three dimensional  models.
An introductory  text book about this  field would be that  of Newman
and  Sproull(73). Finally, there is ARTIFICIAL INTELLIGENCE, which in
my opinion  is  an institution  sheltering  a heterogenous  group  of
embryonic computer  subjects; the biggest of the  present day orphans
include: robotics,    natural language,    theorem proving,    speech
analysis, vision and planning. A more  narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system. There  is no
general  reference   on  Artificial  Intelligence  that   I  wish  to
recommend.

	The related vision  work of specific individuals  has already
been mention in  context. To summarize, my vision work is related to:
Early:  Roberts(63),  Sutherland(63);  Stanford:  Falk,  Feldman  and
Paul(67)  Tenenbaum(72),  Agin(72),  Grape(73);  MIT:  Guzman,  Horn,
Waltz, Krakaurer;UTAH: Warnock, Watkins; other places: SRI and JPL.

	Future progress in computer vision will  proceed in step with
better  computer  hardware, better  computer  graphics  software, and
better world  modeling software.   Future  vision  work at  Stanford,
which is related to the present theory  will be done by Lynn Quam and
Hans Morevac. At JPL and SRI,  similar work on vehicle vision work is
being done.

	The  machine  assembly  task is  being  pursued both  by  the
Artificial Intelligence Group  of the Stanford Research Institute and
by the Hand Eye  Project at Stanford  University. Because the  demand
for doing practical  vision tasks can  be satisfied with  existing ad
hoc methods  or by not using a visual sensor  at all; I expect little
or no vision progress per  se  from  such  reseach,   although  their
demonstrations should be robotic spectaculars.

	Since,  the missing  ingredient for  computer  vision is  the
spatial modeling  to which perceive images can  be related; I believe
that the development of the technology for generating commercial film
and television  by computer  for entertainment will  make significant
contribution to computer vision.

[6.9 	Summary.]

	To recapitulate, three vision system design requirements were
postulated: reality,  generality,  and continuity. These requirements
were illustrated  by discussing  a number  of  vision related  tasks.
Next, a vision  system was described as mediating  between 2-D images
and  a world model;  with the  world model being  further broken down
into a  3-D geometric  model and a  task world  model. Between  these
entities  three  basic  vision  modes were  identified:  recognition,
verification  and  revelation  (description).  Finally,  the  general
purpose vision system was depicted as  a quantitative and description
oriented feedback cycle  which maintain a 3-D geometric model for the
sake of higher qualitative,  symbolic, and recognition oriented  task
processors.

	Approaching the vision system in greater  detail; the role of
seven (or so)  essential kinds of processors were explained: the task
processor,   3-D  modeling  routines,    reality  simulator,    image
analyser,   image synthesizer, comparators,   and locus  solvers. The
processors and data types were assembled into a cart chauffer system.
Computer vision is related to (if not contained in) image processing,
pattern recognition,  computer graphics and artificial intelligence.